Empowered by Innovation Al Е 6 


HOTCHIPS26 


SX-ACE Processor: 
NEC's Brand-New Vector Processor 


Shintaro Momose, Ph.D. 

NEC Corporation 

IT Platform Division 

Manager of SX vector supercomputer development 


August 11th, 2014 


SX History and Technical Evolutions 


NEC has always provided the high sustained 
performance by Vector Super-Computer SX series. 
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Introduction 
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Trend of ТОР500 (15% ~ 10% system) 
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Required Byte/Flop in Real Applications 


According to Japanese Government (MEXT) working group report for a wide 


variety of strategic segment applications, diverse characteristics are observed. 
MEXT: Ministry of Education, Culture, Sports, Science & Technology 


B/F requirement from each application differs greatly. 
Any single architecture cannot cover all application areas. 
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Required memory capacity [PB] Reference: "Report on Strategic Direction/Development of HPC 
in Japan", March 2012 


Required memory bandwidth [Byte/Flop] 


(6. © NEC Corporation, 2014 / HOTCHIPS26 —» «74 mu —2 Empowered by Innovation МЕС 


Concepts of SX-ACE 


The best solution for memory intensive APs 
against scalar processors trend 


Big Core 


Reducing Massive Parallel Difficulty with fewer cores 


Low Power Consumption 


The best memory bandwidth solution 


Hybrid Solution 


Vector / Scalar tightly coupled environment 


(4| © NEC Corporation, 2014 / HOTCHIPS26 Empowered by Innovation МЦЕ(С 


Architecture 
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Processor Overview 


CORE 


“Clock Frequency | 1.0GHz | 
“SPU decoderate instructions 


8GB/s x2 ADB size 
ADB bandwidth 


Memory bandwidth 


Interconnect 


М ап ав ав аз аз ав аз аз аз ав аз з ша | CAMEE FIOL ЕЕЕ 
_—— I—31—1— ------- 256GB/s Memory capacity 64GB 


Memory(DDR3) 
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Single Core Comparison 


The SX-ACE core can provide the world top-level performance and 
the largest memory bandwidth 


1.2 


ш DP performance per core 
ш memory bandwidth per core 


e 
P 
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Performance апа memory bandwidth per core 
noremalized by SX-ACE 
о 
о 


0.0 
SX-ACE Xeon E5- XeonPhi | Tesla K20X PowerPC А2 SPARC64 
4657LV2 Coprocessor IXfx 
7120P 
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e Process rule: 28nm 


e Clock speed: 1GHz 
eDie size: 
23.05 x 24.75mm 


e # of transistors: ӘВТІ. 


e 16ch DDR3 I/F 
e |XS 8GB/s x2 
e 2ch PCIEx8 I/F 
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Core Architecture 


256 operations z 16 parallel x 16 clock cycles 
VPU 64GF 
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Vector instruction scheduler 


256 elements 
256 elements 
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| @ Large SMP configuration can provide high sustained performance 
ш But, over 70% power was consum od ру the memory network 


Ll SX-ACE processor integrates the memory network into LSI 


SX-9 1node 1.6TF SX-ACE 6nodes 1.5TF 


CPU (16LSI, 16cores 
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| Мелоз contiéllér (512 2151) 
jiiiMemory controller (519265 ГРЕЕ 


5601.51 6LSI 
30KW | 2.8KW 


Number of LSI 
reduced to 1/100 


Many LSI consuming more 
than 70% of the power 
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Memory Subsystem 


Core #0 


VPU 
Vector pipelines 
64GF 


Vector registers 


Core Core Core 
Load buffer E Store buffer #1 #2 43 
4GB/s x2 
AAAAAAAAAAAAAAA - 
RCU | 
>. 


MSHR 4GB/s x2 


geret 


memory || 168 х 16 


crossbar 


Bx16 BR 
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Reducing DRAM Energy 


B Cache line size ШАСТ  sRD/WR «BG 
eDRAM activation powers are 100% - 
depending on 
@Sustained memory bandwidth is 80%, 
strongly affected by adopted cache line 
size 
60% · 
B Variable cache line size 
feature 40% 
eSupporting 64B/128B memory 
access granularity 20% 
@128B as a default to reduce power 
@64B for a sparse memory access 0% 
such as stride/indirect memory 64B 128B 256B 
ассеѕ565 RD:WR 1:1, Micron DDR3 power calculator 0.96 
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Assignable Data Buffer (ADB) 


B On-chip Cache for Vector 


ePrivate, 1MB, 4-way, 16-bank 
e256GB/s bandwidth per core 


eSoftware controllable cache = 20460051 ST:128GB/s J 


eCustomized for fast random access 


B Assignable Feature ADB: 1MB, 4-way, 16-bank 


ФА bypass flag in each instruction 
eCompiler/User can control 
e Avoiding cache pollution 


B МНЕ Feature 
@ Redundant memory requests same 
as an inflight memory request are 
held to reduce memory transactions 
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Out-of-Order Vector Memory Access 


8B 


<> 
Vector memory access instruction | о | 1 2 |з [4 [5 |67) 252253254255 


2КВ = 8В х 256 


Consecutive memory access (by HW) 
пут Cian | 
“ш ип King 
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Node Packaging 


CPU 


256GFlops - 64GFlops/core x 4 
256GB/s memory bandwidth 


Ne 


| ХА б 


vet 


Memor 
16 DIMMs (DDR3 2000) 
256GB/s, 64GB 


Rated power consumption - 469W 
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(Outlet) 


m CPU: 0 /— water cooling 
B Other components: air cooling 
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Performance Evaluation 
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Performance Evaluation Conditions 


Evaluation programs 


Off-chip memory bandwidth STREAM (TRIAD) 


Off/On-chip memory bandwidth Himeno Benchmark (High memory intensive) 
Indirect memory access performance | Legendre transformation 


Each evaluation is carried out by only using compiler optimizations without code modifications for individual systems 


Performance comparison 


Memory Rated system 


Sx-9 102GF = 102GF x 1с 256GB/s 1875W 
SX-ACE 256СЕ = 6AGE х 4c 256СВ/5 469W 


IVB(Xeon) 230GF = 19GF х12с 60GB/s 200W 
Power7 245GF = 31GF x 8c 128GB/s 656W 
FX10(Sparc) |234GF = 15GF xi6c 85GB/s 281W 


Power7 and FX10 are measured through a joint research with Tohoku University 
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Memory Bandwidth 1 


Evaluation of Off-chip memory bandwidth 


B Benchmark code: STREAM (TRIAD) 
Sustained memory bandwidth Power efficiency (SX-ACE=1) 
mm 300 
D ==5X-9 o 
= 250 5Х-АСЕ =™=SX-ACE 5 
ТЕ ч а a Ru анана —-—IVB > 
E e Power7 5 
5 2 1 ——FX10 E 
> 100 Ф 
© = 
E 50 & 
> 
05---------: — 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 5Х-9 5Х-АСЕ IVB Power7 FX10 
# of cores used per processor 
B Only the SX-ACE single core can use full B SX-ACE provides the best memory 
memory bandwidth bandwidth per watt 


B This can accelerate memory-intensive serial 
parts in parallel processing 
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Memory Bandwidth 2 


Evaluation of Off/On-chip memory bandwidth 


B Benchmark code: Himeno benchmark (highly memory intensive) 
solving the Poisson equation with the Jacobi iterative method 


Sustained performance (5Х-АСЕ-1) Power efficiency (SX-ACE=1) 
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m ADB and MSHR improve sustained memory Ш SX-ACE is assumed to provide 2~25x higher 
bandwidth compared with its predecessor power efficiency in the case of memory 
B SX-ACE is the best intensive APs having off/on chip memory 
accesses 
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Indirect Memory Access 


Evaluation of Indirect memory access performance 


B Benchmark code: Legendre transformation 
B Cache effective BM (4.4MB data) 


Sustained performance (5Х-АСЕ-1) Power efficiency (SX-ACE=1) 
T 1.2 1.6 
© 10 ; 8 к 
ша “1.2 
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Е 9 1 
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p 9. 
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0.0 + T | 00 +— m , T 
SX-9 SX-ACE Power7 FX10 SX-9 SX-ACE IVB Power7 FX10 
B Cache is effective B SX-ACE improvement provides 25x higher 
B ADB, MSHR, ОоО, and short memory power efficiency than SX-9 
access latency work well B But, IVB is the best due to a larger cache 


and a lower power consumption 
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Conclusions 


B Issue of modern scalar/accelerator processors 
eMassive parallel with small cores 
el ow memory bandwidth 

B SX-ACE direction 
eProviding the big core with large memory bandwidth 
eImproving proven vector architecture 

B SX-ACE processor 
@4 cores vector processor 
@64GF core performance with 64-256GB/s memory bandwidth 


eFfficient memory subsystem for higher sustained memory 
bandwidth 


ш Performance 


eHigh sustained performance and power efficiency for memory 
intensive benchmarks 
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